{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "fwukZZnNTYWE" }, "source": [ "
\n", "\n", "# Data Acquisition and Preprocessing\n", "\n", "Copyright, NLP from scratch, 2024.\n", "\n", "[NLPfor.me](https://www.nlpfor.me)\n", "\n", "------------" ] }, { "cell_type": "markdown", "metadata": { "id": "wG0CakglfdmV" }, "source": [ "## Data Acqusition\n", "\n", "### Requesting Data From a Web Service with the `requests` library" ] }, { "cell_type": "markdown", "metadata": { "id": "vW18g_6ofdmX" }, "source": [ "In this notebook, we will acquire and preprocess text data from online sources. We have already been introduced to the [requests library](https://requests.readthedocs.io/en/latest/) and we will show how using it, with a few simple lines of code, we can pull data from a web service (REST API).\n", "\n", "[The Cocktail DB](https://www.thecocktaildb.com/) is an open source database of cocktails and drinks from around the world, and their ingredients. It also has an API that is free to use for educational purposes.\n", "\n", "Let's get some text data using the `requests` library, here a description of gin. The URL pattern for a given web service is up to its designer, and should be well documented. The Cocktail DB tells us to use the URL pattern `https://www.thecocktaildb.com/api/json/v1/1/search.php?i=` in order to get information back on a drink ingredient.\n", "\n", "First, we import the requests library, than simply make a request using the `get` method and the URL:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tdOoclo3xX1f" }, "outputs": [], "source": [ "# Import the requests library\n", "import requests\n", "\n", "# Make a call to the API\n", "r = requests.get(\"https://www.thecocktaildb.com/api/json/v1/1/search.php?i=gin\")" ] }, { "cell_type": "markdown", "metadata": { "id": "fch1ViDGfdmb" }, "source": [ "Our machine has now made the request and hopefully gotten a response from the server! Let's check the response code." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 29, "status": "ok", "timestamp": 1687981819911, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "v3j4XGvdxhr1", "outputId": "811a3ebd-ee84-4a7e-ca32-d6d397184f0b" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the response\n", "r" ] }, { "cell_type": "markdown", "metadata": { "id": "MkJ7g3Fifdme" }, "source": [ "We can see we have received a response code of 200, which means \"OK\" and that data was returned successfully. Let's check what was returned from our request. There are two ways to do this: the most straightforward is just to return using the `.text` attribute, which shows the contents of the response as an ordinary python string:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 123 }, "executionInfo": { "elapsed": 25, "status": "ok", "timestamp": 1687981819912, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "Zvs8PgFuxix5", "outputId": "63bd2faa-b121-404d-fa63-46e8779922a3" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'{\"ingredients\":[{\"idIngredient\":\"2\",\"strIngredient\":\"Gin\",\"strDescription\":\"Gin is a distilled alcoholic drink that derives its predominant flavour from juniper berries (Juniperus communis). Gin is one of the broadest categories of spirits, all of various origins, styles, and flavour profiles, that revolve around juniper as a common ingredient.\\\\r\\\\n\\\\r\\\\nFrom its earliest origins in the Middle Ages, the drink has evolved from a herbal medicine to an object of commerce in the spirits industry. Gin emerged in England after the introduction of the jenever, a Dutch liquor which originally had been a medicine. Although this development had been taking place since early 17th century, gin became widespread after the William of Orange-led 1688 Glorious Revolution and subsequent import restrictions on French brandy.\\\\r\\\\n\\\\r\\\\nGin today is produced in subtly different ways, from a wide range of herbal ingredients, giving rise to a number of distinct styles and brands. After juniper, gin tends to be flavoured with botanical\\\\/herbal, spice, floral or fruit-flavours or often a combination. It is most commonly consumed mixed with tonic water. Gin is also often used as a base spirit to produce flavoured gin-based liqueurs such as, for example, Sloe gin, traditionally by the addition of fruit, flavourings and sugar.\",\"strType\":\"Gin\",\"strAlcohol\":\"Yes\",\"strABV\":\"40\"}]}'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What are the contents\n", "r.text" ] }, { "cell_type": "markdown", "metadata": { "id": "LBzUfL-Mfdmh" }, "source": [ "We can see there is some nesting of data here, as the response is actually returned in [Javascript Object Notation (JSON) format ](https://en.wikipedia.org/wiki/JSON), or what someone who works in python might instead call a [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries). We can see the resposne in JSON format as well using `.json` method of the response object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 22, "status": "ok", "timestamp": 1687981819913, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "6LamGPIxxlFi", "outputId": "01ef4ec3-ccd8-46f6-e7c3-da0ecf42de57" }, "outputs": [ { "data": { "text/plain": [ "{'ingredients': [{'idIngredient': '2',\n", " 'strIngredient': 'Gin',\n", " 'strDescription': 'Gin is a distilled alcoholic drink that derives its predominant flavour from juniper berries (Juniperus communis). Gin is one of the broadest categories of spirits, all of various origins, styles, and flavour profiles, that revolve around juniper as a common ingredient.\\r\\n\\r\\nFrom its earliest origins in the Middle Ages, the drink has evolved from a herbal medicine to an object of commerce in the spirits industry. Gin emerged in England after the introduction of the jenever, a Dutch liquor which originally had been a medicine. Although this development had been taking place since early 17th century, gin became widespread after the William of Orange-led 1688 Glorious Revolution and subsequent import restrictions on French brandy.\\r\\n\\r\\nGin today is produced in subtly different ways, from a wide range of herbal ingredients, giving rise to a number of distinct styles and brands. After juniper, gin tends to be flavoured with botanical/herbal, spice, floral or fruit-flavours or often a combination. It is most commonly consumed mixed with tonic water. Gin is also often used as a base spirit to produce flavoured gin-based liqueurs such as, for example, Sloe gin, traditionally by the addition of fruit, flavourings and sugar.',\n", " 'strType': 'Gin',\n", " 'strAlcohol': 'Yes',\n", " 'strABV': '40'}]}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert to JSON and dict\n", "r.json()" ] }, { "cell_type": "markdown", "metadata": { "id": "9dwqvXPrfdmj" }, "source": [ "Now to pull out the description, it is a matter of subsetting the returned list associated with the `ingredients` key (there is only one element, element 0) and the getting the value associated with the `strDescription` key within it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 19, "status": "ok", "timestamp": 1687981819914, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "R7r7BuVifdmj", "outputId": "5232e4c7-ffa4-413d-bd03-e3013f44112f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gin is a distilled alcoholic drink that derives its predominant flavour from juniper berries (Juniperus communis). Gin is one of the broadest categories of spirits, all of various origins, styles, and flavour profiles, that revolve around juniper as a common ingredient.\r\n", "\r\n", "From its earliest origins in the Middle Ages, the drink has evolved from a herbal medicine to an object of commerce in the spirits industry. Gin emerged in England after the introduction of the jenever, a Dutch liquor which originally had been a medicine. Although this development had been taking place since early 17th century, gin became widespread after the William of Orange-led 1688 Glorious Revolution and subsequent import restrictions on French brandy.\r\n", "\r\n", "Gin today is produced in subtly different ways, from a wide range of herbal ingredients, giving rise to a number of distinct styles and brands. After juniper, gin tends to be flavoured with botanical/herbal, spice, floral or fruit-flavours or often a combination. It is most commonly consumed mixed with tonic water. Gin is also often used as a base spirit to produce flavoured gin-based liqueurs such as, for example, Sloe gin, traditionally by the addition of fruit, flavourings and sugar.\n" ] } ], "source": [ "description = r.json()['ingredients'][0]['strDescription']\n", "print(description)" ] }, { "cell_type": "markdown", "metadata": { "id": "paMiILuufdmk" }, "source": [ "Great! We have sucessfully retrieved some text from an API using `requests`. We could write more code to return more data programatically and stored in a data structure such as a list or pandas dataframe to work with in an NLP task:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sJvcDKHefdmk" }, "outputs": [], "source": [ "# List of ingredients for building request URLs\n", "ingredients = ['gin', 'vodka', 'rum']\n", "\n", "# Empty list to hold descriptions returned from API\n", "description_list = list()\n", "\n", "# Iterate over the ingredients\n", "for ingredient in ingredients:\n", "\n", " # Make a call to the API\n", " r = requests.get(f\"https://www.thecocktaildb.com/api/json/v1/1/search.php?i={ingredient}\")\n", "\n", " # Pull out the description and append to the list\n", " description = r.json()['ingredients'][0]['strDescription']\n", " description_list.append({'ingredient':ingredient, 'description':description})" ] }, { "cell_type": "markdown", "metadata": { "id": "dJky8PQyfdml" }, "source": [ "Now we have a list storing the description field from data returned from the API calls:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 10, "status": "ok", "timestamp": 1687981820144, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "uSOPLzQ9fdml", "outputId": "036023a7-e089-4ee0-a258-03ff6b489aa9", "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[{'ingredient': 'gin',\n", " 'description': 'Gin is a distilled alcoholic drink that derives its predominant flavour from juniper berries (Juniperus communis). Gin is one of the broadest categories of spirits, all of various origins, styles, and flavour profiles, that revolve around juniper as a common ingredient.\\r\\n\\r\\nFrom its earliest origins in the Middle Ages, the drink has evolved from a herbal medicine to an object of commerce in the spirits industry. Gin emerged in England after the introduction of the jenever, a Dutch liquor which originally had been a medicine. Although this development had been taking place since early 17th century, gin became widespread after the William of Orange-led 1688 Glorious Revolution and subsequent import restrictions on French brandy.\\r\\n\\r\\nGin today is produced in subtly different ways, from a wide range of herbal ingredients, giving rise to a number of distinct styles and brands. After juniper, gin tends to be flavoured with botanical/herbal, spice, floral or fruit-flavours or often a combination. It is most commonly consumed mixed with tonic water. Gin is also often used as a base spirit to produce flavoured gin-based liqueurs such as, for example, Sloe gin, traditionally by the addition of fruit, flavourings and sugar.'},\n", " {'ingredient': 'vodka',\n", " 'description': 'Vodka is a distilled beverage composed primarily of water and ethanol, sometimes with traces of impurities and flavorings. Traditionally, vodka is made by the distillation of fermented cereal grains or potatoes, though some modern brands use other substances, such as fruits or sugar.\\r\\n\\r\\nSince the 1890s, the standard Polish, Russian, Belarusian, Ukrainian, Estonian, Latvian, Lithuanian and Czech vodkas are 40% alcohol by volume ABV (80 US proof), a percentage that is widely misattributed to Dmitri Mendeleev. The European Union has established a minimum of 37.5% ABV for any \"European vodka\" to be named as such. Products sold as \"vodka\" in the United States must have a minimum alcohol content of 40%. Even with these loose restrictions, most vodka sold contains 40% ABV. For homemade vodkas and distilled beverages referred to as \"moonshine\", see moonshine by country.\\r\\n\\r\\nVodka is traditionally drunk neat (not mixed with any water, ice, or other mixer), though it is often served chilled in the vodka belt countries (Belarus, Estonia, Finland, Iceland, Latvia, Lithuania, Norway, Poland, Russia, Sweden, Ukraine). It is also commonly used in cocktails and mixed drinks, such as the vodka martini, Cosmopolitan, vodka tonic, Screwdriver, Greyhound, Black or White Russian, Moscow Mule, and Bloody Mary.\\r\\n\\r\\nScholars debate the beginnings of vodka. It is a contentious issue because very little historical material is available. For many centuries, beverages differed significantly compared to the vodka of today, as the spirit at that time had a different flavor, color and smell, and was originally used as medicine. It contained little alcohol, an estimated maximum of about 14%, as only this amount can be attained by natural fermentation. The still, allowing for distillation (\"burning of wine\"), increased purity, and increased alcohol content, was invented in the 8th century.\\r\\n\\r\\nA common property of the vodkas produced in the United States and Europe is the extensive use of filtration prior to any additional processing including the addition of flavorants. Filtering is sometimes done in the still during distillation, as well as afterwards, where the distilled vodka is filtered through activated charcoal and other media to absorb trace amounts of substances that alter or impart off-flavors to the vodka. However, this is not the case in the traditional vodka-producing nations, so many distillers from these countries prefer to use very accurate distillation but minimal filtering, thus preserving the unique flavors and characteristics of their products.\\r\\n\\r\\nThe master distiller is in charge of distilling the vodka and directing its filtration, which includes the removal of the \"fore-shots\", \"heads\" and \"tails\". These components of the distillate contain flavor compounds such as ethyl acetate and ethyl lactate (heads) as well as the fusel oils (tails) that impact the usually desired clean taste of vodka. Through numerous rounds of distillation, or the use of a fractioning still, the taste is modified and clarity is increased. In contrast, distillery process for liquors such as whiskey, rum, and baijiu allow portions of the \"heads\" and \"tails\" to remain, giving them their unique flavors.\\r\\n\\r\\nRepeated distillation of vodka will make its ethanol level much higher than is acceptable to most end users, whether legislation determines strength limits or not. Depending on the distillation method and the technique of the stillmaster, the final filtered and distilled vodka may have as much as 95–96% ethanol. As such, most vodka is diluted with water prior to bottling.\\r\\n\\r\\nPolish distilleries make a very pure (96%, 192 proof, formerly also 98%) rectified spirit (Polish language: spirytus rektyfikowany). Technically a form of vodka, it is sold in liquor stores rather than pharmacies. Similarly, the German market often carries German, Hungarian, Polish, and Ukrainian-made varieties of vodka of 90 to 95% ABV. A Bulgarian vodka, Balkan 176°, has an 88% alcohol content. Everclear, an American brand, is also sold at 95% ABV.'},\n", " {'ingredient': 'rum',\n", " 'description': 'Rum is a distilled alcoholic beverage made from sugarcane byproducts, such as molasses, or directly from sugarcane juice, by a process of fermentation and distillation. The distillate, a clear liquid, is then usually aged in oak barrels.\\r\\n\\r\\nThe majority of the world\\'s rum production occurs in the Caribbean and Latin America. Rum is also produced in Scotland, Austria, Spain, Australia, New Zealand, Fiji, the Philippines, India, Reunion Island, Mauritius, South Africa, Taiwan, Thailand, Japan, the United States, and Canada.\\r\\n\\r\\nRums are produced in various grades. Light rums are commonly used in cocktails, whereas \"golden\" and \"dark\" rums were typically consumed straight or neat, on the rocks, or used for cooking, but are now commonly consumed with mixers. Premium rums are also available, made to be consumed either straight or iced.\\r\\n\\r\\nRum plays a part in the culture of most islands of the West Indies as well as in The Maritimes and Newfoundland. This beverage has famous associations with the Royal Navy (where it was mixed with water or beer to make grog) and piracy (where it was consumed as bumbo). Rum has also served as a popular medium of economic exchange, used to help fund enterprises such as slavery (see Triangular trade), organized crime, and military insurgencies (e.g., the American Revolution and Australia\\'s Rum Rebellion).\\r\\n\\r\\nThe precursors to rum date back to antiquity. Development of fermented drinks produced from sugarcane juice is believed to have first occurred either in ancient India or in China, and to have spread from there. An example of such an early drink is brum. Produced by the Malay people, brum dates back thousands of years. Marco Polo also recorded a 14th-century account of a \"very good wine of sugar\" that was offered to him in the area that became modern-day Iran.\\r\\n\\r\\nThe first distillation of rum took place on the sugarcane plantations of the Caribbean in the 17th century. Plantation slaves first discovered molasses, a byproduct of the sugar refining process, could be fermented into alcohol. Later, distillation of these alcoholic byproducts concentrated the alcohol and removed impurities, producing the first true rums. Tradition suggests rum first originated on the island of Barbados. However, in the decade of the 1620s, rum production was recorded in Brazil. A liquid identified as rum has been found in a tin bottle found on the Swedish warship Vasa, which sank in 1628.\\r\\n\\r\\nA 1651 document from Barbados stated, \"The chief fuddling they make in the island is Rumbullion, alias Kill-Divil, and thi is made of sugar canes distilled, a hot, hellish, and terrible liquor.\"'}]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "description_list" ] }, { "cell_type": "markdown", "metadata": { "id": "7y7oc6Kkfdmm" }, "source": [ "Finally, we can plunk this into a pandas dataframe to make things a bit nicer:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "executionInfo": { "elapsed": 596, "status": "ok", "timestamp": 1687981820734, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "OBGKY6K7fdmm", "outputId": "e26c9553-5409-4825-f279-f76b10164743" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ingredientdescription
0ginGin is a distilled alcoholic drink that derive...
1vodkaVodka is a distilled beverage composed primari...
2rumRum is a distilled alcoholic beverage made fro...
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " ingredient description\n", "0 gin Gin is a distilled alcoholic drink that derive...\n", "1 vodka Vodka is a distilled beverage composed primari...\n", "2 rum Rum is a distilled alcoholic beverage made fro..." ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check ouput\n", "import pandas as pd\n", "\n", "# Create a pandas dataframe from the list of key-value pairs\n", "# Keys are the column name, values are the values for each row\n", "desc_df = pd.DataFrame(description_list)\n", "\n", "# Check\n", "desc_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "GmL7RKb1fdmn" }, "source": [ "## Scraping Web Pages with BeautifulSoup\n", "\n", "Unfortunately, the data we need is not always available for download or from a web service. In some cases we may have to [web scrape](https://en.wikipedia.org/wiki/Web_scraping) data if it is \"locked up\" in pages that are meant to be viewed in the browser.\n", "\n", "It should be noted that web scraping is a bit of a gray area legally, but as with free and open APIs, be respectful of whomever is hosting the content and associated resources (*i.e.* do not make excessive requests, or scrape entire sites without permission).\n", "\n", "Because the data is locked up in the code of the web page (a \"beautiful soup\" of HTML, Javascript, CSS, and other languages) we may also have to apply some elbow grease and do some work to pull out the elements in the page code that we want. Fortunately, [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) makes this easy, and we'll see you can almost treat the page code like a searchable database.\n", "\n", "Here we will scrape some text from the excellent online resource [Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/ ), in particular [this page](https://christophm.github.io/interpretable-ml-book/what-is-machine-learning.html). Given its simple page structure, this should be relatively straightforward to do. `requests` has already been imported above, and we will now import the `BeautifulSoup` class for use shortly (the library name in python for BeautifulSoup is `bs4`, as it is version 4):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "L_oyARKmfdmn" }, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "# Page to scrape\n", "url = 'https://christophm.github.io/interpretable-ml-book/what-is-machine-learning.html'" ] }, { "cell_type": "markdown", "metadata": { "id": "Gx4QKLt5fdmo" }, "source": [ "Next, we make a request for the page using `requests`, just like we did for an API. The difference here is we are hitting a web server, which will return a web page, normally requested by and rendered in a browser:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OzMCZBq4fdmo" }, "outputs": [], "source": [ "# Make the request\n", "r = requests.get(url)" ] }, { "cell_type": "markdown", "metadata": { "id": "zy9HaNhLfdmp" }, "source": [ "Let's take a look at the first 2000 characters of the result as a string:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 123 }, "executionInfo": { "elapsed": 19, "status": "ok", "timestamp": 1687981820986, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "RMRj38oxfdmp", "outputId": "4c898de0-dc99-4300-d33c-d39c4fab5ebd" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'\\n\\n\\n\\n \\n \\n 2.2 What Is Machine Learning? | Interpretable Machine Learning\\n \\n \\n\\n \\n \\n \\n \\n \\n\\n \\n \\n \\n \\n \\n\\n\\n\\n\\n\\n\\n \\n \\n \\n \\n \\n\\n\\n\\n\\n\\n\n", "\n", "\n", "\n", "\n", "2.2 What Is Machine Learning? | Interpretable Machine Learning\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "

\n", "Interpretable Machine Learning\n", "

\n", "
\n", "
\n", "
\n", "
\n", " Buy Book \n", "Buy\n", "
\n", "

2.2 What Is Machine Learning?

\n", "

Machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data.

\n", "

For example, to predict the value of a house, the computer would learn patterns from past house sales.\n", "The book focuses on supervised machine learning, which covers all prediction problems where we have a dataset for which we already know the outcome of interest (e.g. past house prices) and want to learn to predict the outcome for new data.\n", "Excluded from supervised learning are for example clustering tasks (= unsupervised learning) where we do not have a specific outcome of interest, but want to find clusters of data points.\n", "Also excluded are things like reinforcement learning, where an agent learns to optimize a certain reward by acting in an environment (e.g. a computer playing Tetris).\n", "The goal of supervised learning is to learn a predictive model that maps features of the data (e.g. house size, location, floor type, …) to an output (e.g. house price).\n", "If the output is categorical, the task is called classification, and if it is numerical, it is called regression.\n", "The machine learning algorithm learns a model by estimating parameters (like weights) or learning structures (like trees).\n", "The algorithm is guided by a score or loss function that is minimized.\n", "In the house value example, the machine minimizes the difference between the estimated house price and the predicted price.\n", "A fully trained machine learning model can then be used to make predictions for new instances.

\n", "

Estimation of house prices, product recommendations, street sign detection, credit default prediction and fraud detection:\n", "All these examples have in common that they can be solved by machine learning.\n", "The tasks are different, but the approach is the same:
\n", "Step 1: Data collection.\n", "The more, the better.\n", "The data must contain the outcome you want to predict and additional information from which to make the prediction.\n", "For a street sign detector (“Is there a street sign in the image?”), you would collect street images and label whether a street sign is visible or not.\n", "For a credit default predictor, you need past data on actual loans, information on whether the customers were in default with their loans, and data that will help you make predictions, such as income, past credit defaults, and so on.\n", "For an automatic house value estimator program, you could collect data from past house sales and information about the real estate such as size, location, and so on.
\n", "Step 2: Enter this information into a machine learning algorithm that generates a sign detector model, a credit rating model or a house value estimator.
\n", "Step 3: Use model with new data.\n", "Integrate the model into a product or process, such as a self-driving car, a credit application process or a real estate marketplace website.

\n", "

Machines surpass humans in many tasks, such as playing chess (or more recently Go) or predicting the weather.\n", "Even if the machine is as good as a human or a bit worse at a task, there remain great advantages in terms of speed, reproducibility and scaling.\n", "A once implemented machine learning model can complete a task much faster than humans, reliably delivers consistent results and can be copied infinitely.\n", "Replicating a machine learning model on another machine is fast and cheap.\n", "The training of a human for a task can take decades (especially when they are young) and is very costly.\n", "A major disadvantage of using machine learning is that insights about the data and the task the machine solves is hidden in increasingly complex models.\n", "You need millions of numbers to describe a deep neural network, and there is no way to understand the model in its entirety.\n", "Other models, such as the random forest, consist of hundreds of decision trees that “vote” for predictions.\n", "To understand how the decision was made, you would have to look into the votes and structures of each of the hundreds of trees.\n", "That just does not work no matter how clever you are or how good your working memory is.\n", "The best performing models are often blends of several models (also called ensembles) that cannot be interpreted, even if each single model could be interpreted.\n", "If you focus only on performance, you will automatically get more and more opaque models.\n", "\n", "The winning models on machine learning competitions are often ensembles of models or very complex models such as boosted trees or deep neural networks.

\n", "\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup = BeautifulSoup(r.text)\n", "\n", "soup" ] }, { "cell_type": "markdown", "metadata": { "id": "-kBPILKtfdmq" }, "source": [ "That seems more or less the same but now a little easier to read. What has BeautifulSoup done? Now, we can use the `.find` method to pull out individual page elements, by [tag, class, or id](https://en.wikipedia.org/wiki/HTML#Elements). Here we will just grab the first paragraph element:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 272, "status": "ok", "timestamp": 1687981821253, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "kCWyxYwXfdmr", "outputId": "102d5b84-e944-4de0-eb93-8384cb505e8a" }, "outputs": [ { "data": { "text/plain": [ "

Machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data.

" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find('p')" ] }, { "cell_type": "markdown", "metadata": { "id": "JSXst7ecfdmr" }, "source": [ "Furthermore, we can use `find_all` to return all elements of a given type in the page as an array and iterate over it, and pull out only the text of each paragraph using the `.text` attribute. Let's look at the first 3 paragraphs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 33, "status": "ok", "timestamp": 1687981821254, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "wt647SWLfdms", "outputId": "a820e392-80e1-4179-f5a4-ad89e7854539" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data.\n", "For example, to predict the value of a house, the computer would learn patterns from past house sales.\n", "The book focuses on supervised machine learning, which covers all prediction problems where we have a dataset for which we already know the outcome of interest (e.g. past house prices) and want to learn to predict the outcome for new data.\n", "Excluded from supervised learning are for example clustering tasks (= unsupervised learning) where we do not have a specific outcome of interest, but want to find clusters of data points.\n", "Also excluded are things like reinforcement learning, where an agent learns to optimize a certain reward by acting in an environment (e.g. a computer playing Tetris).\n", "The goal of supervised learning is to learn a predictive model that maps features of the data (e.g. house size, location, floor type, …) to an output (e.g. house price).\n", "If the output is categorical, the task is called classification, and if it is numerical, it is called regression.\n", "The machine learning algorithm learns a model by estimating parameters (like weights) or learning structures (like trees).\n", "The algorithm is guided by a score or loss function that is minimized.\n", "In the house value example, the machine minimizes the difference between the estimated house price and the predicted price.\n", "A fully trained machine learning model can then be used to make predictions for new instances.\n" ] } ], "source": [ "# Check first 3 elements\n", "for elem in soup.find_all(\"p\")[0:2]:\n", " print(elem.text)" ] }, { "cell_type": "markdown", "metadata": { "id": "A0dXcuNRfdms" }, "source": [ "Great, now let's join it all together, and replace the newline characters with spaces, to create one giant string of text:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 29, "status": "ok", "timestamp": 1687981821254, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "8rlbYAkvfdmt", "outputId": "d5c50c6d-354b-4846-e2f9-9a86cec0db74", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data. For example, to predict the value of a house, the computer would learn patterns from past house sales. The book focuses on supervised machine learning, which covers all prediction problems where we have a dataset for which we already know the outcome of interest (e.g. past house prices) and want to learn to predict the outcome for new data. Excluded from supervised learning are for example clustering tasks (= unsupervised learning) where we do not have a specific outcome of interest, but want to find clusters of data points. Also excluded are things like reinforcement learning, where an agent learns to optimize a certain reward by acting in an environment (e.g. a computer playing Tetris). The goal of supervised learning is to learn a predictive model that maps features of the data (e.g. house size, location, floor type, …) to an output (e.g. house price). If the output is categorical, the task is called classification, and if it is numerical, it is called regression. The machine learning algorithm learns a model by estimating parameters (like weights) or learning structures (like trees). The algorithm is guided by a score or loss function that is minimized. In the house value example, the machine minimizes the difference between the estimated house price and the predicted price. A fully trained machine learning model can then be used to make predictions for new instances. Estimation of house prices, product recommendations, street sign detection, credit default prediction and fraud detection: All these examples have in common that they can be solved by machine learning. The tasks are different, but the approach is the same: Step 1: Data collection. The more, the better. The data must contain the outcome you want to predict and additional information from which to make the prediction. For a street sign detector (“Is there a street sign in the image?”), you would collect street images and label whether a street sign is visible or not. For a credit default predictor, you need past data on actual loans, information on whether the customers were in default with their loans, and data that will help you make predictions, such as income, past credit defaults, and so on. For an automatic house value estimator program, you could collect data from past house sales and information about the real estate such as size, location, and so on. Step 2: Enter this information into a machine learning algorithm that generates a sign detector model, a credit rating model or a house value estimator. Step 3: Use model with new data. Integrate the model into a product or process, such as a self-driving car, a credit application process or a real estate marketplace website. Machines surpass humans in many tasks, such as playing chess (or more recently Go) or predicting the weather. Even if the machine is as good as a human or a bit worse at a task, there remain great advantages in terms of speed, reproducibility and scaling. A once implemented machine learning model can complete a task much faster than humans, reliably delivers consistent results and can be copied infinitely. Replicating a machine learning model on another machine is fast and cheap. The training of a human for a task can take decades (especially when they are young) and is very costly. A major disadvantage of using machine learning is that insights about the data and the task the machine solves is hidden in increasingly complex models. You need millions of numbers to describe a deep neural network, and there is no way to understand the model in its entirety. Other models, such as the random forest, consist of hundreds of decision trees that “vote” for predictions. To understand how the decision was made, you would have to look into the votes and structures of each of the hundreds of trees. That just does not work no matter how clever you are or how good your working memory is. The best performing models are often blends of several models (also called ensembles) that cannot be interpreted, even if each single model could be interpreted. If you focus only on performance, you will automatically get more and more opaque models. The winning models on machine learning competitions are often ensembles of models or very complex models such as boosted trees or deep neural networks. \n" ] } ], "source": [ "text = \"\"\n", "\n", "for paragraph in soup.find_all(\"p\"):\n", " text += paragraph.text.replace('\\n', ' ') + ' '\n", "\n", "print(text)" ] }, { "cell_type": "markdown", "metadata": { "id": "SndgFOaufdmt" }, "source": [ "We now have a scraped text data from a website! Doing so for more complicated pages or programatically over many pages can be accomplished with more code and inspecting the different pages' structure. The difficulty or ease of doing so will depend upon how the page is site is structured and page code." ] }, { "cell_type": "markdown", "metadata": { "id": "Gz9_m9W5fdmt" }, "source": [ "## Data Preprocessing\n", "\n", "We have now acquired some text. Before using this text in an ML application of NLP, we first need to preprocess the data.\n", "\n", "As outlined in the slides, major steps in preprocessing text are:\n", "- Normalization (addressing case, removing punctuation and stop words, stemming or lemmatization)\n", "- Tokenization (breaking up into individual units of language, usually words)\n", "- Vectorization (converting tokens to structured numeric data)" ] }, { "cell_type": "markdown", "metadata": { "id": "UM5xq1twfdmu" }, "source": [ "### Normalization\n", "\n", "There are a few things we need to do here: *addressing case, removing punctuation, and stemming or lemmatization*. For simplicity's sake, we will not expand contractions (don't, won't, can't, etc.) though this would be another normalization step. We will also only try the simpler technique of stemming, though there are lemmatizers built in to packages such as [nltk](https://www.nltk.org/_modules/nltk/stem/wordnet.html) and [TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html#words-inflection-and-lemmatization).\n", "\n", "To standardize the case, we simply convert everything to lowercase:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 70 }, "executionInfo": { "elapsed": 26, "status": "ok", "timestamp": 1687981821255, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "OQpIAHZifdmu", "outputId": "b8dbccea-a218-4f13-c470-5aafd9bd2e2b" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data. for example, to predict the value of a house, the computer would learn patterns from past house sales. the book focuses on supervised machine learning, which covers all prediction problems where we have a dataset for which we already know the outcome of interest (e.g.\\xa0past house prices) and want to learn to predict the outcome for new data. excluded from supervised learning are for'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert to lower case\n", "text = text.lower()\n", "text[0:500]" ] }, { "cell_type": "markdown", "metadata": { "id": "i-5ifOJkfdmv" }, "source": [ "It appears there are also some unicode characters mixed in there, which is never good. Dealing with special characters and different text encodings can be one of the challenge parts of doing NLP. We will change the encoding to ASCII to address these:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 70 }, "executionInfo": { "elapsed": 25, "status": "ok", "timestamp": 1687981821256, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "lpDjQLslfdmv", "outputId": "3fc86e99-db04-4b26-ea58-8ef5949912bb" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data. for example, to predict the value of a house, the computer would learn patterns from past house sales. the book focuses on supervised machine learning, which covers all prediction problems where we have a dataset for which we already know the outcome of interest (e.g.past house prices) and want to learn to predict the outcome for new data. excluded from supervised learning are for '" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = text.encode('ASCII', errors='ignore').decode()\n", "text[0:500]" ] }, { "cell_type": "markdown", "metadata": { "id": "Vu-77Ypsfdmw" }, "source": [ "That's better, we can see the special characters like `\\xa0` that were present before are now gone. Next we will remove all punctuation. Fortunately, all the punctuation characters are contained in a string stored in the python `string` base module:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 23, "status": "ok", "timestamp": 1687981821257, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "VYJRB8qmfdm3", "outputId": "393bd80c-db69-40b3-d758-9984125e24fc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~\n" ] } ], "source": [ "from string import punctuation\n", "\n", "print(punctuation)" ] }, { "cell_type": "markdown", "metadata": { "id": "VaTLP6n1fdm3" }, "source": [ "We can now iterate over each punctuation character, and update the text, replacing it with the empty string `''`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ppvgUAKMfdm4" }, "outputs": [], "source": [ "for punctuation_mark in punctuation:\n", " text = text.replace(punctuation_mark, '')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 122 }, "executionInfo": { "elapsed": 18, "status": "ok", "timestamp": 1687981821258, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "xRzf8fczfdm4", "outputId": "7d18514e-75bd-4923-faf7-ff229a544e55" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'machine learning is a set of methods that computers use to make and improve predictions or behaviors based on data for example to predict the value of a house the computer would learn patterns from past house sales the book focuses on supervised machine learning which covers all prediction problems where we have a dataset for which we already know the outcome of interest egpast house prices and want to learn to predict the outcome for new data excluded from supervised learning are for example clustering tasks unsupervised learning where we do not have a specific outcome of interest but want to find clusters of data points also excluded are things like reinforcement learning where an agent learns to optimize a certain reward by acting in an environment ega computer playing tetris the goal of supervised learning is to learn a predictive model that maps features of the data eghouse size location floor type to an output eghouse price if the output is categorical the task is called classi'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the result\n", "text[0:1000]" ] }, { "cell_type": "markdown", "metadata": { "id": "-f_rCGu2fdm5" }, "source": [ "We can already see some strange things are happening, such as the *e.g.* getting folded into the words \"past\" and \"house\" to create the tokens \"egpast\" and \"eghouse\". Preprocessing text is not an exact science... we will proceed as is for now, though there perhaps could have been better ways to tokenize or deal with problematic portions of this text such as abbreviations." ] }, { "cell_type": "markdown", "metadata": { "id": "CcBunenJfdm5" }, "source": [ "### Stemming\n", "Stemming and lemmatization are built into the very powerful [nltk toolkit](https://www.nltk.org/). Here we choose to do simple stemming using the `SnowBallStemmer` (a particular type of stemming algorithm):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "executionInfo": { "elapsed": 1878, "status": "ok", "timestamp": 1687981823119, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "Jco34Xp-fdm5", "outputId": "4ef2482d-d309-499e-e025-ea43cc0bd6bf" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'run'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.stem import SnowballStemmer\n", "\n", "# Instantiate stemmer for english\n", "sbstem = SnowballStemmer('english')\n", "\n", "# Check\n", "sbstem.stem('running')" ] }, { "cell_type": "markdown", "metadata": { "id": "lC5socchfdm6" }, "source": [ "Stemmers in `nltk` operation on individual tokens, so we must iterate over the freeform text, then join everything back together again:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bdgtcAYLfdm6" }, "outputs": [], "source": [ "# Create an empty list for the stemmed words\n", "stemmed_words = list()\n", "\n", "# Iterate over each word and stem and add to new string\n", "for word in text.split(' '):\n", " stemmed_words.append(sbstem.stem(word))\n", "\n", "# Join it all back together and remove any repeated or extraneous spaces\n", "text = ' '.join(stemmed_words).strip()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 70 }, "executionInfo": { "elapsed": 255, "status": "ok", "timestamp": 1687981823368, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "XspuoPN1fdm7", "outputId": "e8f59f05-cbea-40f1-ef11-0f9bd23ab11e" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'machin learn is a set of method that comput use to make and improv predict or behavior base on data for exampl to predict the valu of a hous the comput would learn pattern from past hous sale the book focus on supervis machin learn which cover all predict problem where we have a dataset for which we alreadi know the outcom of interest egpast hous price and want to learn to predict the outcom for new data exclud from supervis learn are for exampl cluster task unsupervis learn where we do not hav'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check\n", "text[0:500]" ] }, { "cell_type": "markdown", "metadata": { "id": "N2RUv2R1fdm7" }, "source": [ "We can see that words like *machine* have been stemmed to *machin*, and *improve* to *improv*, and so on, so we appear to have applied stemming correctly. This is enough normalization for now, and we can move on to tokenizing and vectorizing our text." ] }, { "cell_type": "markdown", "metadata": { "id": "PUmbRzM8fdm7" }, "source": [ "### Tokenization" ] }, { "cell_type": "markdown", "metadata": { "id": "gYEpKgSFfdm8" }, "source": [ "Our approach for tokenization could be as simple as splitting on whitespace. As we saw above, we actually did this step and then undid it, as the `nltk` stemmer works on individual words. Alternatively, we could have applied stemming *after* or as part of tokenization, as the different steps in text preprocessing are not necessarily always in a particular order depending upon implementation).\n", "\n", "Splitting on whitespace is as simple as using the `.split()` method in base python built in to any string variable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 21, "status": "ok", "timestamp": 1687981823370, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "SxB3oiIzfdm8", "outputId": "744f5407-f1b3-444e-f9ef-2557d4cc474d" }, "outputs": [ { "data": { "text/plain": [ "['machin', 'learn', 'is', 'a', 'set', 'of', 'method', 'that', 'comput', 'use']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Tokenize and show first 10 tokens\n", "text.split(' ')[0:10]" ] }, { "cell_type": "markdown", "metadata": { "id": "j-9ONzIzfdm8" }, "source": [ "More sophisticated approaches for tokenization exist. We actually do not need do this step manually, as it is included in the vectorization step in code as part of `scikit-learn` as we will see below." ] }, { "cell_type": "markdown", "metadata": { "id": "NXJ6Hy-gfdm9" }, "source": [ "### Vectorization\n", "\n", "There are two standard types of vectorization used in traditional NLP: *count vectorization* and *term frequency - inverse document frequency (tf-idf)* vectorization. Binary (\"One-hot\") encoding with a boolean (0/1) flag for word occurrence in each document can also be done, though this is less common.\n", "\n", "The two former vectorization methods are implemented in `scikit-learn` in the `feature_extraction.text` submodule and we can apply as below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RYndP7OUfdm9" }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", "\n", "# Instatiate, fit and transform - Count vectorization\n", "cv = CountVectorizer()\n", "count_vectorized = cv.fit_transform([text])\n", "\n", "# Instatiate, fit and transform - TF-IDF vectorization\n", "tfidf = TfidfVectorizer()\n", "tfidf_vectorized = tfidf.fit_transform([text])" ] }, { "cell_type": "markdown", "metadata": { "id": "31Gqguvzfdm-" }, "source": [ "Let's take a look at the outputs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 18, "status": "ok", "timestamp": 1687981823372, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "YZ7CZEbHfdm_", "outputId": "7a788dff-3eb2-462c-db07-b0044aed8110" }, "outputs": [ { "data": { "text/plain": [ "<1x263 sparse matrix of type ''\n", "\twith 263 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "count_vectorized" ] }, { "cell_type": "markdown", "metadata": { "id": "MrvFeFjTfdm_" }, "source": [ "This is a sparse matrix - the *document-term matrix* - with each feature (column) being a token that appeared in the documents (rows), so there are 263 unique tokens in our single document of text. This can be cast into a dense array and we can pass the tokens as the column names using pandas:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 110 }, "executionInfo": { "elapsed": 260, "status": "ok", "timestamp": 1687981823617, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "smi41HIHfdnA", "outputId": "bb5805f6-48fd-44d6-c26a-69b350b50b3e" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
aboutactactualadditadvantagagentalgorithmallalreadialso...whichwillwinwithworkworswouldyouyoungyour
02111113212...32122131011
\n", "

1 rows × 263 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " about act actual addit advantag agent algorithm all alreadi also \\\n", "0 2 1 1 1 1 1 3 2 1 2 \n", "\n", " ... which will win with work wors would you young your \n", "0 ... 3 2 1 2 2 1 3 10 1 1 \n", "\n", "[1 rows x 263 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "count_df = pd.DataFrame(count_vectorized.todense(), columns=cv.get_feature_names_out())\n", "\n", "count_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "EXWOdb3WfdnB" }, "source": [ "Great, we now have a count of the number of occurrences of each token in our text!\n", "\n", "Contrast this with the floating point numbers from tf-idf vectorization:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 174 }, "executionInfo": { "elapsed": 11, "status": "ok", "timestamp": 1687981823618, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "0t5xrZrUfdnB", "outputId": "b9a6d8c9-3efd-4c8e-f9b3-05b86efb0269" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
aboutactactualadditadvantagagentalgorithmallalreadialso...whichwillwinwithworkworswouldyouyoungyour
00.025570.0127850.0127850.0127850.0127850.0127850.0383550.025570.0127850.02557...0.0383550.025570.0127850.025570.025570.0127850.0383550.1278480.0127850.012785
\n", "

1 rows × 263 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " about act actual addit advantag agent algorithm \\\n", "0 0.02557 0.012785 0.012785 0.012785 0.012785 0.012785 0.038355 \n", "\n", " all alreadi also ... which will win with \\\n", "0 0.02557 0.012785 0.02557 ... 0.038355 0.02557 0.012785 0.02557 \n", "\n", " work wors would you young your \n", "0 0.02557 0.012785 0.038355 0.127848 0.012785 0.012785 \n", "\n", "[1 rows x 263 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf_df = pd.DataFrame(tfidf_vectorized.todense(), columns=tfidf.get_feature_names_out())\n", "\n", "tfidf_df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "qoTH7PHYfdnC" }, "source": [ "From our count vectorization, we now have a count of each token. Normally our document-term matrix would have many rows, one for each document in our corpus. Nonetheless, we can find the most frequently occurring tokens in the text we scraped by doing a simple sort and making a bar chart:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 472 }, "executionInfo": { "elapsed": 538, "status": "ok", "timestamp": 1687981824147, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "lF_Wx2UifdnC", "outputId": "81999c62-a024-4aae-a2c5-b13fcb42467e" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "count_df.T[0].sort_values().tail(10).plot(kind='barh')\n", "plt.xlabel('Count of occurrence')\n", "plt.ylabel('Token')\n", "plt.title('Most Frequently Occuring Tokens')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "LJcUg2k-fdnC" }, "source": [ "We can see that many of the most frequently occuring tokens are [stop words](https://en.wikipedia.org/wiki/Stop_word). We could have dealt with these as part of preprocessing, however, fortunately for us basic stopword removal is built into the vectorizers in scikit-learn and so this can be done concurrently.\n", "\n", "So we will take a step back here and re-vectorize our data, while removing stopwords, then replot the result:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 472 }, "executionInfo": { "elapsed": 309, "status": "ok", "timestamp": 1687981824453, "user": { "displayName": "Myles Harrison", "userId": "02225822412878142066" }, "user_tz": 240 }, "id": "HcjqddW7fdnD", "outputId": "6e611d20-6e7f-4675-9afe-bbe59ac588b2" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", "\n", "# Instatiate, fit and transform - Count vectorization\n", "cv = CountVectorizer(stop_words='english')\n", "count_vectorized = cv.fit_transform([text])\n", "\n", "# Recreate the dataframe\n", "count_df = pd.DataFrame(count_vectorized.todense(), columns=cv.get_feature_names_out())\n", "\n", "# Visualize again\n", "count_df.T[0].sort_values().tail(10).plot(kind='barh')\n", "plt.xlabel('Count of occurrence')\n", "plt.ylabel('Token')\n", "plt.title('Most Frequently Occuring Tokens')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "gI8wviJdfdnD" }, "source": [ "This seems to be more representative, given what we know about the topic of the original text. Without doing any machine learning, we have done some basic text analytics (content analysis) given we now have structured data!\n", "\n", "Here we have performed the basics of the preprocessing steps for text on a single document. In practice, this would be done over a very large corpus of many documents, and the document-term matrix produced from vectorization can grown quite large." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "

Copyright NLP from scratch, 2024.

" ] } ], "metadata": { "colab": { "provenance": [ { "file_id": "https://github.com/mylesmharrison/nlp4free/blob/master/notebooks/NLP4Free_Part2_DataAcquisitionandPreprocessing.ipynb", "timestamp": 1687980773112 } ] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.6" } }, "nbformat": 4, "nbformat_minor": 4 }